Library Imports
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Template
spark = (
SparkSession.builder
.master("local")
.appName("Exploring Joins")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
Initial Datasets
pets = spark.createDataFrame(
[
(1, 1, 'Bear', 5),
(2, 1, 'Chewie', 10),
(3, 2, 'Roger', 15),
], ['id', 'breed_id', 'nickname', 'age']
)
pets.toPandas()
id | breed_id | nickname | age | |
---|---|---|---|---|
0 | 1 | 1 | Bear | 5 |
1 | 2 | 1 | Chewie | 10 |
2 | 3 | 2 | Roger | 15 |
groupby_columns = ['breed_id']
Option 1: Using a Dictionary
df_1 = (
pets
.groupby(groupby_columns)
.agg({
"*": "count",
"age": "sum",
})
)
df_1.toPandas()
breed_id | count(1) | sum(age) | |
---|---|---|---|
0 | 1 | 2 | 15 |
1 | 2 | 1 | 15 |
What Happened:
- Very similar to
pandas
agg
function. - The resultant column names are a bit awkward to use after the fact.
Option 2: Using List of Columns
df_2 = (
pets
.groupby(groupby_columns)
.agg(
F.count("*"),
F.sum("age"),
)
)
df_2.toPandas()
breed_id | count(1) | sum(age) | |
---|---|---|---|
0 | 1 | 2 | 15 |
1 | 2 | 1 | 15 |
What Happened:
- Here we use the Spark
agg
functions. - Again, the resultant column names are a bit awkward to use after the fact.
Option 3: Using List of Columns, with Aliases
df_3 = (
pets
.groupby(groupby_columns)
.agg(
F.count("*").alias("count_of_breeds"),
F.sum("age").alias("total_age_of_breeds"),
)
)
df_3.toPandas()
breed_id | count_of_breeds | total_age_of_breeds | |
---|---|---|---|
0 | 1 | 2 | 15 |
1 | 2 | 1 | 15 |
What Happened:
- Here we use the Spark
agg
functions andalias
ed the resultant columns to new names. - This provides cleaner column names that we can use later on.
Summary
I encourage using option #3.
This creates more elegant and meaning names for the new aggregate columns.
A withColumnRenamed
can be performed after the aggregates, but why not do it with an alias
? It's easier as well.